Introduction

In practice, it is easier to construct hashing algorithms which operate on relatively small, fixed input lengths, whilst still keeping the output length even smaller ( is still less than ). But hash functions are usually used on much larger inputs - for example, creating checksums for integrity verification of files. The Merkle-Damgård transform allows us to turn such a hash function, which operates on small fixed input lengths, into a hash function which operates on inputs of arbitrary lengths.

The Merkle-Damgård Construction

In particular, given a compression function which works with inputs of a "small", fixed input length and has outputs with length , the Merkle-Damgård transform allows us to use to a construct a hash function which takes messages of arbitrary length, denoted by , and produces digests of the same output length as .

The construction is similar to a block cipher in the sense that the message is chopped up into blocks. In contrast to block ciphers, however, this is done rather differently. Each block has length (since each block will be input into ), but it is not comprised entirely of message bits. Instead, each block contains message bits and the other bits () represent the so-called chaining variable for the current block.

This means that the message needs to be chopped up into message fragments , all of length . If the message length is not a multiple of , then the message is padded by appending a 1 to it and then appending 0s until the message length is short of a multiple of the fragment length by exactly the number of bits needed to encode the message length . The total length of the padding (including the 1, the 0s and the encoding of the message length) is denoted by .

When the message length is a multiple of the fragment length , padding still needs to be added. In a particular, an additional padding block is appended to the message, following the exact same procedure as before. The padding block begins with a and is followed by s - the last bits of the padding block again encode the message length .

Note

The number of bits reserved for encoding the message length is fixed for a given Merkle-Damgård construction. Usually is 64 bits, resulting in a maximum message length of , which is quite a reasonable limit.

After padding, the actual hash algorithm begins by appending an initialisation vector (IV) of length to the first message fragment . The IV is always a constant which is pre-defined in the specification of the Merkle-Damgård construction.

Example

The SHA256 hash function uses the following 256-bit IV (the value is in hex):

This initialisation vector serves as the initial chaining variable . The concatenation of the first message block with the IV is passed to the compression function , whose output becomes the next chaining variable. In general, the -th iteration takes the -th message block and appends to it the chaining variable . The chaining variable for the current stage is simply the output of from the previous iteration. The final output, i.e. the hash generated by the Merkle-Damgård function , is the final chaining variable.

Security of Merkle-Damgård Constructions

The reason why the Merkle-Damgård transform is used ubiquitously is the fact that it preserves collision resistance.

Theorem: Merkle-Damgård Collision Resistance

If the compression function is collision resistant, then so is the Merkle-Damgård function .

Proof: Merkle-Damgård Collision Resistance

Suppose, towards contradiction that there is an efficient collision finder which can find a collision in with non-negligible probability. Let and be two inputs of lengths and , respectively, such that . Let be the blocks which is divided into, and, let be the blocks which is divided into. Similarly, let and be the chaining variables used at each iteration of the hashing of and , respectively (remember that the chaining variables and are also the output of ).

Case 1: If the two inputs have different lengths, i.e. , then the hash is and the hash is . However, means that which is a contradiction because and so (remember that the length is appended to the message when padding) - we have found two different inputs which cause a collision in the collision resistant .

Case 2: If the two inputs have the same length, i.e. , then they are also divided into the same number of blocks . Let denote the -th input passed to when computing , and let denote the -th input passed to when computing . Additionally, we will denote the output of as , and we will denote the output of as .

Now, and so . This can only happen if or if is a collision pair for and the same logic propagates backwards - in general, can be true only if or if is a collision pair for . The inputs are a collision pair for which means that and so there must be some index for which which means for sure that and so turn out to be a collision in , which is a contradiction.